24 research outputs found
Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications
Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Engineering Fast Multilevel Support Vector Machines
The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters which requires computationally expensive fitting. This increases the quality but also reduces the performance dramatically. We introduce a generalized fast multilevel framework for regular and weighted SVM and discuss several versions of its algorithmic components that lead to a good trade-off between quality and time. Our framework is implemented using PETSc which allows an easy integration with scientific computing tasks. The experimental results demonstrate significant speed up compared to the state-of-the-art nonlinear SVM libraries. Our source code, documentation and parameters are available at https://github.com/esadr/mlsvm
Predictive Models for Bariatric Surgery Risks with Imbalanced Medical Datasets
Bariatric surgery (BAR) has become a popular treatment for type 2 diabetes mellitus (T2DM) which is among the most critical obesity-related comorbidities. Patients who have bariatric surgery, are exposed to complications after surgery. Furthermore, the mid- to long-term complications after bariatric surgery can be deadly and increase the complexity of managing safety of these operations and healthcare costs. Current studies on BAR complications have mainly used risk scoring for identifying patients who are more likely to have complications after surgery. Though, these studies do not take into considera-tion the imbalanced nature of the data where the size of the class of interest (patients who have complications after surgery) is relatively small. We propose the use of imbalanced classification techniques to tackle the imbalanced bariatric surgery data: synthetic minority oversampling technique (SMOTE), random undersampling, and en-semble learning classification methods including Random Forest, Bagging, and AdaBoost. Moreover, we improve classification performance through using Chi-Squared, Information Gain, and Correlation-based feature selection (CFS) techniques. We study the Premier Healthcare Database with focus on the most-frequent complications includ-ing Diabetes, Angina, Heart Failure, and Stroke. Our results show that the ensemble learning-based classification techniques using any feature selection method mentioned above are the best approach for handling the imbalanced nature of the bariatric surgical outcome data. In our evaluation, we find a slight preference toward using SMOTE method compared to the random undersampling method. These results demonstrate the potential of machine-learning tools as clinical decision support in identifying risks/outcomes associated with bariatric surgery and their effectiveness in reducing the surgery complications as well as improving patient care
A Weighted Support Vector Machine Method For Control Chart Pattern Recognition
Manual inspection and evaluation of quality control data is a tedious task that requires the undistracted attention of specialized personnel. On the other hand, automated monitoring of a production process is necessary, not only for real time product quality assessment, but also for potential machinery malfunction diagnosis. For this reason, control chart pattern recognition (CCPR) methods have received a lot of attention over the last two decades. Current state-of-the-art control monitoring methodology includes K charts which are based on support vector machines (SVM). Although K charts have some profound benefits, their performance deteriorate when the learning examples for the normal class greatly outnumbers the ones for the abnormal class. Such problems are termed imbalanced and represent the vast majority of the real life control pattern classification problems. Original SVM demonstrate poor performance when applied directly to these problems. In this paper, we propose the use of weighted support vector machines (WSVM) for automated process monitoring and early fault diagnosis. We show the benefits of WSVM over traditional SVM, compare them under various fault scenarios. We evaluate the proposed algorithm in binary and multi-class environments for the most popular abnormal quality control patterns as well as a real application from wafer manufacturing industry. © 2014 Elsevier Ltd. All rights reserved
Weighted Relaxed Support Vector Machines
Classification of imbalanced data is challenging when outliers exist. In this paper, we propose a supervised learning method to simultaneously classify imbalanced data and reduce the influence of outliers. The proposed method is a cost-sensitive extension of the relaxed support vector machines (RSVM), where the restricted penalty free-slack is split independently between the two classes in proportion to the number samples in each class with different weights, hence given the name weighted relaxed support vector machines (WRSVM). We compare classification results of WRSVM with SVM, WSVM and RSVM on public benchmark datasets with imbalanced classes and outlier noise, and show that WRSVM produces more accurate and robust classification results
Constraint Relaxation, Cost-Sensitive Learning And Bagging For Imbalanced Classification Problems With Outliers
Supervised learning consists in developing models able to distinguish data that belong to different categories (classes). When data are available in different proportions the problem becomes imbalanced and the performance of standard classification methods deteriorates significantly. Imbalanced classification becomes even more challenging in the presence of outliers. In this paper, we study several algorithmic modifications of support vector machines classifier for tackling imbalanced problems with outliers. We provide computational evidence that the combined use of cost sensitive learning with constraint relaxation performs better, on average, compared to algorithmic tweaks that involve bagging, a popular approach for dealing with imbalanced problems or outliers separately. The proposed technique is embedded and requires the solution of a single convex optimization problem with no outlier detection preprocessing